The goal of this project is to utilize statistical matching methods to search for a subset of Beta clients that are representative of Release. The specific use-case of this proof-of-concept was to utilize performance, configuration, and environment covariates of the clients for matching. Validation of the matching was performed on a hold-out set of Firefox user engagement covariates.
The following tables represent the relative difference between the Beta and Release train (v67) and validation (v68) data sets for the mean and median respectively. These initial results are promising and suggest that such techniques could be applied to Mozilla use-cases.
| active_hours | active_hours_max | uri_count | uri_count_max | search_count | search_count_max | num_pages | num_pages_max | daily_max_tabs | daily_max_tabs_max | daily_unique_domains | daily_unique_domains_max | daily_tabs_opened | daily_tabs_opened_max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pre-matching: v67 | 0.0308290 | 0.0326158 | 0.0239820 | 0.0345788 | 0.0326937 | 0.0370581 | 0.0087112 | 0.0082139 | 0.5308586 | 0.4699185 | 0.0152401 | 0.0198211 | 0.1966846 | 0.1884580 |
| post-matching: v67 | 0.0612966 | 0.0568527 | 0.0454368 | 0.0384508 | 0.0176654 | 0.0029548 | 0.0648153 | 0.0638618 | 0.2885793 | 0.2740081 | 0.0523747 | 0.0456359 | 0.0633665 | 0.0663631 |
| pre-matching: v68 | 0.0606451 | 0.0998606 | 0.0728483 | 0.1234272 | 0.0480039 | 0.0946543 | 0.0829937 | 0.0840384 | 0.4287439 | 0.3486020 | 0.0082032 | 0.0291493 | 0.1734935 | 0.1202410 |
| post-matching: v68 | 0.0938118 | 0.0939657 | 0.0613485 | 0.0814801 | 0.0135225 | 0.0229419 | 0.0418302 | 0.0431463 | 0.3055825 | 0.2682372 | 0.0099754 | 0.0003550 | 0.0136131 | 0.0101047 |
| active_hours | active_hours_max | uri_count | uri_count_max | search_count | search_count_max | num_pages | num_pages_max | daily_max_tabs | daily_max_tabs_max | daily_unique_domains | daily_unique_domains_max | daily_tabs_opened | daily_tabs_opened_max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pre-matching: v67 | 0.0839799 | 0.0826347 | 0.1046832 | 0.1269036 | 0.0476190 | 0 | 0.2401578 | 0.2372583 | 0.1333333 | 0 | 0.0104167 | 0.0833333 | 0.0000000 | 0.0000000 |
| post-matching: v67 | 0.1032947 | 0.0886228 | 0.1100945 | 0.1116751 | 0.0857143 | 0 | 0.3017174 | 0.2972759 | 0.0285714 | 0 | 0.0676294 | 0.1515152 | 0.0740741 | 0.0588235 |
| pre-matching: v68 | 0.1224242 | 0.1720430 | 0.1718107 | 0.2323232 | 0.2500000 | 0 | 0.3676804 | 0.3570644 | 0.1000000 | 0 | 0.0411921 | 0.1333333 | 0.0285714 | 0.1176471 |
| post-matching: v68 | 0.1331136 | 0.1338028 | 0.1317805 | 0.1308017 | 0.0000000 | 0 | 0.2223247 | 0.2144493 | 0.0322581 | 0 | 0.0142857 | 0.0769231 | 0.1000000 | 0.1000000 |
Each new Release version of Firefox is available as a prerelease before it is launched to the general community. It is highly desirable to utilize the telemetry from Beta versions of Firefox to determine critical aspects of Release behavior before it is launched to the general user base. However, it is well known that the Beta population has distinctly different characteristics than Release, such as the distribution of country of origin of its users, and have higher incidents of crashes. Therefore, directly utilizing Beta telemetry to inform Release is not statistically valid.
One possible approach to deal with this discrepency is to use statistical matching techniques to find a subset of Beta that is representative of Release. What connotates “representative” depends upon the desired use-case (outcome), such as performance characteristics, or crash rates. In this work, we focus on user engagement metrics as the chosen use-case. Here, we follow two different strategies for validate the resultant model:
client_id for Beta that resembles Release. Our application is then querying the current Beta data (Version N+1) for this client_id, and then calculate the metrics we care about from the covariates we care about. This gives us an idea of how these users do indeed change in time.Our methodology aims at defyning which aspect of Release behavior we will address with the Beta subset (e.g., start-up, user engagement, browser responsiveness) and, then, determining and focusing on the statistical matching approaches to address the chosen use-case. The methodology is summarized as follows:
The following filters are applied:
| rows | columns | discrete_columns | continuous_columns | all_missing_columns | total_missing_values | complete_rows | total_observations | memory_usage | |
|---|---|---|---|---|---|---|---|---|---|
| v67 | 302819 | 97 | 8 | 89 | 0 | 0 | 302819 | 29373443 | 179294160 |
| v68 | 328042 | 97 | 8 | 89 | 0 | 0 | 328042 | 31820074 | 192914864 |
The following covariates were collected. These were categorized under training and hold-out. The former were used for training a statistical matching model. The latter are not included in training and are used for determining model performance. The covariates are further subcategorized as to what they measure.
The follows makes up the training data set, used in statistical matching:
The followings filters constitute the validation data set:
In this step, we perform feature engineering on categorical data into dummy variables, in case some of them turn out to be determinant factors when imputing other variables. Specifically, we employ two widely used techniques:
Through these techniques, we end up with larger training and validation data sets. The following reports show the differences between those data sets pre (df_train_f) and post (df_train_encoder) feature engineering.
| data.frame | ncol | nrow | |
|---|---|---|---|
| pre-engineering | df_train_f | 63 | 302819 |
| post-engineering | df_train_encoder | 97 | 302819 |
| variable | position | class | |
|---|---|---|---|
| 52 | V1 | 1 | character |
| 53 | fxa_configured | 29 | logical |
| 54 | sync_configured | 30 | logical |
| 55 | is_default_browser | 31 | character |
| 56 | locale | 32 | character |
| 57 | normalized_channel | 33 | integer |
| 58 | default_search_engine | 35 | character |
| 59 | country | 36 | integer |
| 60 | cpu_vendor | 42 | integer |
| 61 | is_wow64 | 45 | numeric |
| 62 | distro_id_norm | 53 | character |
| 63 | timezone_cat | 54 | character |
| 64 | cpu_l2_cache_kb_cat | 59 | character |
| 65 | label_beta | 25 | integer |
| 66 | label_release | 26 | integer |
| 67 | fxa_configured_False | 29 | integer |
| 68 | fxa_configured_True | 30 | integer |
| 69 | sync_configured_False | 31 | integer |
| 70 | sync_configured_True | 32 | integer |
| 71 | is_default_browser_False | 33 | integer |
| 72 | is_default_browser_True | 34 | integer |
| 73 | locale_en.GB | 35 | integer |
| 74 | locale_en.US | 36 | integer |
| 75 | normalized_channel_beta | 37 | integer |
| 76 | normalized_channel_release | 38 | integer |
| 77 | default_search_engine_Bing | 40 | integer |
| 78 | default_search_engine_DuckDuckGo | 41 | integer |
| 79 | default_search_engine_Google | 42 | integer |
| 80 | default_search_engine_other..bundled. | 43 | integer |
| 81 | default_search_engine_other..non.bundled. | 44 | integer |
| 82 | default_search_engine_Yahoo | 45 | integer |
| 83 | country_GB | 46 | integer |
| 84 | country_US | 47 | integer |
| 85 | cpu_vendor_AMD | 53 | integer |
| 86 | cpu_vendor_Intel | 54 | integer |
| 87 | cpu_vendor_Other | 55 | integer |
| 88 | is_wow64_False | 58 | integer |
| 89 | is_wow64_True | 59 | integer |
| 90 | distro_id_norm_acer | 67 | integer |
| 91 | distro_id_norm_Mozilla | 68 | integer |
| 92 | distro_id_norm_other | 69 | integer |
| 93 | distro_id_norm_Yahoo | 70 | integer |
| 94 | timezone_cat_..12..10. | 71 | integer |
| 95 | timezone_cat_..10..8. | 72 | integer |
| 96 | timezone_cat_..8..6. | 73 | integer |
| 97 | timezone_cat_..6..4. | 74 | integer |
| 98 | timezone_cat_..4..2. | 75 | integer |
| 99 | timezone_cat_..2.0. | 76 | integer |
| 100 | timezone_cat_.0.2. | 77 | integer |
| 101 | timezone_cat_.2.4. | 78 | integer |
| 102 | timezone_cat_.4.6. | 79 | integer |
| 103 | timezone_cat_.6.8. | 80 | integer |
| 104 | timezone_cat_.8.10. | 81 | integer |
| 105 | timezone_cat_.10.12. | 82 | integer |
| 106 | timezone_cat_.12.14. | 83 | integer |
| 107 | cpu_l2_cache_kb_cat_..1024 | 88 | integer |
| 108 | cpu_l2_cache_kb_cat_..256 | 89 | integer |
| 109 | cpu_l2_cache_kb_cat_..512 | 90 | integer |
| 110 | cpu_l2_cache_kb_cat_..1024.1 | 91 | integer |
| 111 | default_search_engine_missing | 96 | numeric |
| data.frame | ncol | nrow | |
|---|---|---|---|
| pre-engineering | df_validate_f | 63 | 328042 |
| post-engineering | df_validate_encoder | 97 | 328042 |
| variable | position | class | |
|---|---|---|---|
| 52 | V1 | 1 | character |
| 53 | fxa_configured | 29 | logical |
| 54 | sync_configured | 30 | logical |
| 55 | is_default_browser | 31 | character |
| 56 | locale | 32 | character |
| 57 | normalized_channel | 33 | integer |
| 58 | default_search_engine | 35 | character |
| 59 | country | 36 | integer |
| 60 | cpu_vendor | 42 | integer |
| 61 | is_wow64 | 45 | numeric |
| 62 | distro_id_norm | 53 | character |
| 63 | timezone_cat | 54 | character |
| 64 | cpu_l2_cache_kb_cat | 59 | character |
| 65 | label_beta | 25 | integer |
| 66 | label_release | 26 | integer |
| 67 | fxa_configured_False | 29 | integer |
| 68 | fxa_configured_True | 30 | integer |
| 69 | sync_configured_False | 31 | integer |
| 70 | sync_configured_True | 32 | integer |
| 71 | is_default_browser_False | 33 | integer |
| 72 | is_default_browser_True | 34 | integer |
| 73 | locale_en.GB | 35 | integer |
| 74 | locale_en.US | 36 | integer |
| 75 | normalized_channel_beta | 37 | integer |
| 76 | normalized_channel_release | 38 | integer |
| 77 | default_search_engine_Bing | 40 | integer |
| 78 | default_search_engine_DuckDuckGo | 41 | integer |
| 79 | default_search_engine_Google | 42 | integer |
| 80 | default_search_engine_missing | 43 | integer |
| 81 | default_search_engine_other..bundled. | 44 | integer |
| 82 | default_search_engine_other..non.bundled. | 45 | integer |
| 83 | default_search_engine_Yahoo | 46 | integer |
| 84 | country_GB | 47 | integer |
| 85 | country_US | 48 | integer |
| 86 | cpu_vendor_AMD | 54 | integer |
| 87 | cpu_vendor_Intel | 55 | integer |
| 88 | cpu_vendor_Other | 56 | integer |
| 89 | is_wow64_False | 59 | integer |
| 90 | is_wow64_True | 60 | integer |
| 91 | distro_id_norm_acer | 68 | integer |
| 92 | distro_id_norm_Mozilla | 69 | integer |
| 93 | distro_id_norm_other | 70 | integer |
| 94 | distro_id_norm_Yahoo | 71 | integer |
| 95 | timezone_cat_..12..10. | 72 | integer |
| 96 | timezone_cat_..10..8. | 73 | integer |
| 97 | timezone_cat_..8..6. | 74 | integer |
| 98 | timezone_cat_..6..4. | 75 | integer |
| 99 | timezone_cat_..4..2. | 76 | integer |
| 100 | timezone_cat_..2.0. | 77 | integer |
| 101 | timezone_cat_.0.2. | 78 | integer |
| 102 | timezone_cat_.2.4. | 79 | integer |
| 103 | timezone_cat_.4.6. | 80 | integer |
| 104 | timezone_cat_.6.8. | 81 | integer |
| 105 | timezone_cat_.8.10. | 82 | integer |
| 106 | timezone_cat_.10.12. | 83 | integer |
| 107 | timezone_cat_.12.14. | 84 | integer |
| 108 | cpu_l2_cache_kb_cat_..1024 | 89 | integer |
| 109 | cpu_l2_cache_kb_cat_..256 | 90 | integer |
| 110 | cpu_l2_cache_kb_cat_..512 | 91 | integer |
| 111 | cpu_l2_cache_kb_cat_..1024.1 | 92 | integer |
As statistical matching typically trains a machine learning (ML) model for calculation of propensity scores, variable selection should be employed. However, the literature suggests that typical ML techniques of limiting covariates to those that best predict the response are not helpful for statistical matching. Rather, covariates that are unrelated to the exposure (i.e., Beta or Release) but related to the outcome (i.e., user engagement metrics) should always be included in a propensity score model.
Therefore, in this step, we apply the Boruta algorithm as an initial pre-filter to the covariates, to narrow the feature selection search space. The Boruta algorithm is a wrapper built around the random forest classification algorithm, which tries to capture all the important features you might have in your data set concerning an outcome variable. In short, we apply the Boruta algorithm for each user engagement metric as the model outcome. Thus, for each outcome, the algorithm verifies if a feature is important or not. Finally, we find the top 5 and top 10 rankings features per metric, and add to lists. We also perfom this same process to a class balanced data set.
Hence, in total, we builted four covariate sets, that we called as experiments:
For the last step of our methodology, we use statistical matching methods to search for a subset of Beta clients that are representative of Release. To do that, a range of statistical matching models were reviewed, using the R library Matchit:
default = 1) best control matches for each individual in the treatment group, by using a distance measure specified by the distance option (default = logit). In short, at each matching step the method chooses the control unit that is not yet matched but is closest to the treated unit on the distance measure.Before propensity scores are calculated, we define six covariate sets (experiments) to be utilized in the model selection. The first four experiments were obtained under the conditions described above. One of the remaining experiments was included according to statistical tests. That is, we compute the estimation of a normalized difference, a traditional statistical approach, which calculates the difference between the control and treatment group for every variable included in the selection model. In this case, absolute scores higher than 25% are considered suspect, and may indicate an imbalance for that specific variable. Variables that create imbalance should be included in the selection model. For the last experiment, we considered all the variables present in the data set (except for user engagement).
Finally, a range of Beta overrepresentations were tested. In the following, 2x means there were twice as many Beta samples as Release.
1x Beta to Release (50% - 50%)
2x Beta to Release (70% - 30%)
4x Beta to Release (80% - 20%)
The highest performant model was trained on the v67 data set and has the following properties:
experiment 3In this application, we need to balance the two groups (Beta and Release) considering the other covariates (e.g., environment and performance metrics) and then look at the difference in user engagement metrics between the balanced Beta and Release for that version (N). The utility of this application is to inform us on how Beta is different concerning Release in user engagement, with all the other covariates being equal.
| active_hours | active_hours_max | uri_count | uri_count_max | search_count | search_count_max | num_pages | num_pages_max | daily_max_tabs | daily_max_tabs_max | daily_unique_domains | daily_unique_domains_max | daily_tabs_opened | daily_tabs_opened_max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| beta (mean) | 0.7977679 | 1.5379846 | 149.3878515 | 309.7738724 | 2.3311445 | 5.4187111 | 1.638074e+04 | 1.657372e+04 | 8.0837222 | 11.9707020 | 4.7234374 | 8.1824028 | 18.2089827 | 35.5746839 |
| release (mean) | 0.8498615 | 1.6306940 | 156.4986413 | 322.1612357 | 2.3730655 | 5.4347695 | 1.751605e+04 | 1.770435e+04 | 6.2733605 | 9.3960957 | 4.9844993 | 8.5736703 | 17.1239004 | 33.3607594 |
| delta (mean) | 0.0612966 | 0.0568527 | 0.0454368 | 0.0384508 | 0.0176654 | 0.0029548 | 6.481530e-02 | 6.386180e-02 | 0.2885793 | 0.2740081 | 0.0523747 | 0.0456359 | 0.0633665 | 0.0663631 |
| beta (median) | 0.5197569 | 1.0569444 | 86.1428571 | 175.0000000 | 0.8000000 | 2.0000000 | 3.846560e+03 | 3.998500e+03 | 3.8571429 | 6.0000000 | 3.3565341 | 5.0909091 | 8.3333333 | 16.0000000 |
| release (median) | 0.5796296 | 1.1597222 | 96.8000000 | 197.0000000 | 0.8750000 | 2.0000000 | 5.508600e+03 | 5.690000e+03 | 3.7500000 | 6.0000000 | 3.6000000 | 6.0000000 | 9.0000000 | 17.0000000 |
| delta (median) | 0.1032947 | 0.0886228 | 0.1100945 | 0.1116751 | 0.0857143 | 0.0000000 | 3.017174e-01 | 2.972759e-01 | 0.0285714 | 0.0000000 | 0.0676294 | 0.1515152 | 0.0740741 | 0.0588235 |
| metric | label | active_hours | active_hours_max | uri_count | uri_count_max | search_count | search_count_max | num_pages | num_pages_max | daily_max_tabs | daily_max_tabs_max | daily_unique_domains | daily_unique_domains_max | daily_tabs_opened | daily_tabs_opened_max |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | beta | 0.8236611 | 1.577508 | 152.74550 | 311.0213 | 2.4506498 | 5.636171 | 17363.463 | 17558.93 | 9.603628 | 13.811495 | 5.060464 | 8.743610 | 20.491908 | 39.64786 |
| mean | beta - matched | 0.7977679 | 1.537985 | 149.38785 | 309.7739 | 2.3311445 | 5.418711 | 16380.741 | 16573.72 | 8.083722 | 11.970702 | 4.723437 | 8.182403 | 18.208983 | 35.57468 |
| mean | release | 0.8498615 | 1.630694 | 156.49864 | 322.1612 | 2.3730655 | 5.434769 | 17516.049 | 17704.35 | 6.273360 | 9.396096 | 4.984499 | 8.573670 | 17.123900 | 33.36076 |
| median | beta | 0.5309524 | 1.063889 | 86.66667 | 172.0000 | 0.8333333 | 2.000000 | 4185.667 | 4340.00 | 4.250000 | 6.000000 | 3.562500 | 5.500000 | 9.000000 | 17.00000 |
| median | beta - matched | 0.5197569 | 1.056944 | 86.14286 | 175.0000 | 0.8000000 | 2.000000 | 3846.560 | 3998.50 | 3.857143 | 6.000000 | 3.356534 | 5.090909 | 8.333333 | 16.00000 |
| median | release | 0.5796296 | 1.159722 | 96.80000 | 197.0000 | 0.8750000 | 2.000000 | 5508.600 | 5690.00 | 3.750000 | 6.000000 | 3.600000 | 6.000000 | 9.000000 | 17.00000 |
In short, we want to know if there is any significant difference between the average user engagement metrics in the Beta and Release groups, for the same version (v67). Here, we use the unpaired two-samples Wilcoxon test, which is a non-parametric alternative to the unpaired two-samples t-test used to compare two independent groups of samples. Our question is: Is there any significant difference between Beta (v67) and Release (v67) user engagement metrics?
If the resultatns p-values are less than the significance level \(alpha = 0.05\), we can conclude that Beta’s user engagement metrics, in average, are significantly different from Release users.
| p_value | diff | |
|---|---|---|
| active_hours | 0.0032274 | TRUE |
| active_hours_max | 0.0028748 | TRUE |
| uri_count | 0.0431074 | TRUE |
| uri_count_max | 0.0117049 | TRUE |
| search_count | 0.3103178 | FALSE |
| search_count_max | 0.2549964 | FALSE |
| num_pages | 0.0003315 | TRUE |
| num_pages_max | 0.0003186 | TRUE |
| daily_max_tabs | 0.2051756 | FALSE |
| daily_max_tabs_max | 0.1004788 | FALSE |
| daily_unique_domains | 0.0038884 | TRUE |
| daily_unique_domains_max | 0.0043621 | TRUE |
| daily_tabs_opened | 0.8696248 | FALSE |
| daily_tabs_opened_max | 0.4046741 | FALSE |
For a graphical comparison, we plot the holdout covariate distributions for the following subsets:
NOTE: Guiding lines have been added for the following:
| daily_num_sessions_started | daily_num_sessions_started_max | FX_PAGE_LOAD_MS_2_PARENT | memory_mb | num_active_days | num_addons | num_bookmarks | profile_age | session_length | session_length_max | TIME_TO_DOM_COMPLETE_MS | TIME_TO_DOM_CONTENT_LOADED_END_MS | TIME_TO_DOM_INTERACTIVE_MS | TIME_TO_LOAD_EVENT_END_MS | TIME_TO_NON_BLANK_PAINT_MS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| beta (mean) | 2.7321709 | 4.9926873 | 3247.1869234 | 9579.7998207 | 5.5523212 | 7.1386288 | 233.7638860 | 908.3743631 | 10.7072617 | 20.4610988 | 3851.5736062 | 2566.7619973 | 2102.7159586 | 3571.3891057 | 1657.5890160 |
| release (mean) | 2.8754903 | 5.2335855 | 3027.8108054 | 9447.2322438 | 5.5698425 | 5.6691919 | 160.4287272 | 896.5396045 | 9.3214724 | 18.2702112 | 3294.6494654 | 2296.3124719 | 1797.2162710 | 3019.4899187 | 1443.3046942 |
| delta (mean) | 0.0498417 | 0.0460293 | 0.0724537 | 0.0140324 | 0.0031457 | 0.2591969 | 0.4571199 | 0.0132005 | 0.1486664 | 0.1199158 | 0.1690390 | 0.1177756 | 0.1699849 | 0.1827789 | 0.1484678 |
| beta (median) | 1.8000000 | 3.0000000 | 2779.5305788 | 8058.0000000 | 6.0000000 | 6.0000000 | 24.9166667 | 721.0000000 | 6.4794677 | 12.2702780 | 2646.8827684 | 1717.7745048 | 1463.2633003 | 2452.7563807 | 1160.6414141 |
| release (median) | 2.0000000 | 4.0000000 | 2649.1832061 | 8071.0000000 | 6.0000000 | 5.0000000 | 26.0000000 | 704.0000000 | 6.4109720 | 11.7819440 | 2495.8947368 | 1628.9545455 | 1376.1397059 | 2298.4834123 | 1091.0504202 |
| delta (median) | 0.1000000 | 0.2500000 | 0.0492029 | 0.0016107 | 0.0000000 | 0.2000000 | 0.0416667 | 0.0241477 | 0.0106841 | 0.0414477 | 0.0604946 | 0.0545257 | 0.0633101 | 0.0671195 | 0.0637835 |
| metric | label | daily_num_sessions_started | daily_num_sessions_started_max | FX_PAGE_LOAD_MS_2_PARENT | memory_mb | num_active_days | num_addons | num_bookmarks | profile_age | session_length | session_length_max | TIME_TO_DOM_COMPLETE_MS | TIME_TO_DOM_CONTENT_LOADED_END_MS | TIME_TO_DOM_INTERACTIVE_MS | TIME_TO_LOAD_EVENT_END_MS | TIME_TO_NON_BLANK_PAINT_MS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | beta | 2.368895 | 4.281399 | 3463.708 | 8965.156 | 5.346169 | 7.855376 | 242.48780 | 893.7534 | 12.296199 | 22.70657 | 4388.914 | 2737.638 | 2404.393 | 4126.912 | 1833.656 |
| mean | beta - matched | 2.732171 | 4.992687 | 3247.187 | 9579.800 | 5.552321 | 7.138629 | 233.76389 | 908.3744 | 10.707262 | 20.46110 | 3851.574 | 2566.762 | 2102.716 | 3571.389 | 1657.589 |
| mean | release | 2.875490 | 5.233586 | 3027.811 | 9447.232 | 5.569843 | 5.669192 | 160.42873 | 896.5396 | 9.321472 | 18.27021 | 3294.649 | 2296.312 | 1797.216 | 3019.490 | 1443.305 |
| median | beta | 1.666667 | 3.000000 | 2952.174 | 8031.000 | 6.000000 | 7.000000 | 26.00000 | 711.0000 | 7.710555 | 14.80889 | 2918.205 | 1856.036 | 1618.024 | 2739.282 | 1251.720 |
| median | beta - matched | 1.800000 | 3.000000 | 2779.531 | 8058.000 | 6.000000 | 6.000000 | 24.91667 | 721.0000 | 6.479468 | 12.27028 | 2646.883 | 1717.775 | 1463.263 | 2452.756 | 1160.641 |
| median | release | 2.000000 | 4.000000 | 2649.183 | 8071.000 | 6.000000 | 5.000000 | 26.00000 | 704.0000 | 6.410972 | 11.78194 | 2495.895 | 1628.955 | 1376.140 | 2298.483 | 1091.050 |
| p_value | diff | |
|---|---|---|
| daily_num_sessions_started | 0.0000026 | TRUE |
| daily_num_sessions_started_max | 0.0000010 | TRUE |
| FX_PAGE_LOAD_MS_2_PARENT | 0.0011319 | TRUE |
| memory_mb | 0.9551818 | FALSE |
| num_active_days | 0.8000686 | FALSE |
| num_addons | 0.0000000 | TRUE |
| num_bookmarks | 0.0075289 | TRUE |
| profile_age | 0.5966258 | FALSE |
| session_length | 0.0228950 | TRUE |
| session_length_max | 0.1650719 | FALSE |
| TIME_TO_DOM_COMPLETE_MS | 0.0004808 | TRUE |
| TIME_TO_DOM_CONTENT_LOADED_END_MS | 0.0354082 | TRUE |
| TIME_TO_DOM_INTERACTIVE_MS | 0.0002422 | TRUE |
| TIME_TO_LOAD_EVENT_END_MS | 0.0001210 | TRUE |
| TIME_TO_NON_BLANK_PAINT_MS | 0.0006373 | TRUE |
NOTE: Guiding lines have been added for the following:
Our main objective was to inform how users Beta are different concerning Release in terms of user engagement, with all the other covariates being equal. From these prior analysis, we can see that there are significant differences between both groups (Beta and Release) concerning some user engagement metrics, listed as follows.
num_pagesnum_pages_maxactive_hoursactive_hours_maxuri_counturi_count_maxdaily_unique_domainsdaily_unique_domains_maxIn addition, by analyzing the distributions of the training covariates, we can see exactly which were the variables that presented the biggest discrepancies. That is, in which aspects the matching was not able to balance the datasets efficiently. The most different covariates are listed as follows.
num_addonsdaily_num_sessions_started_maxdaily_num_sessions_startedIn this application, we need to balance the Beta and Release data sets to resemble each other across the covariates we are concerned with, that is, the user engagement metrics. Balancing, in this case, yields a set of client_id for Beta that resembles Release. This gives us an idea of how these users do indeed change in time. If we see changes that are larger than anticipated, then we know that something significant is happening in user engagement that we can “forecast” in the subsequent Release.
The next step is to subset the validation v68 set by these matched Beta profiles. This reduces the Beta sample size used in the subsequent analysis:
Mean
| active_hours | active_hours_max | uri_count | uri_count_max | search_count | search_count_max | num_pages | num_pages_max | daily_max_tabs | daily_max_tabs_max | daily_unique_domains | daily_unique_domains_max | daily_tabs_opened | daily_tabs_opened_max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pre-matching | 0.0606451 | 0.0998606 | 0.0728483 | 0.1234272 | 0.0480039 | 0.0946543 | 0.0829937 | 0.0840384 | 0.4287439 | 0.3486020 | 0.0082032 | 0.0291493 | 0.1734935 | 0.1202410 |
| post-matching | 0.0938118 | 0.0939657 | 0.0613485 | 0.0814801 | 0.0135225 | 0.0229419 | 0.0418302 | 0.0431463 | 0.3055825 | 0.2682372 | 0.0099754 | 0.0003550 | 0.0136131 | 0.0101047 |
Median
| active_hours | active_hours_max | uri_count | uri_count_max | search_count | search_count_max | num_pages | num_pages_max | daily_max_tabs | daily_max_tabs_max | daily_unique_domains | daily_unique_domains_max | daily_tabs_opened | daily_tabs_opened_max | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pre-matching | 0.1224242 | 0.1720430 | 0.1718107 | 0.2323232 | 0.25 | 0 | 0.3676804 | 0.3570644 | 0.1000000 | 0 | 0.0411921 | 0.1333333 | 0.0285714 | 0.1176471 |
| post-matching | 0.1331136 | 0.1338028 | 0.1317805 | 0.1308017 | 0.00 | 0 | 0.2223247 | 0.2144493 | 0.0322581 | 0 | 0.0142857 | 0.0769231 | 0.1000000 | 0.1000000 |
| metric | label | active_hours | active_hours_max | uri_count | uri_count_max | search_count | search_count_max | num_pages | num_pages_max | daily_max_tabs | daily_max_tabs_max | daily_unique_domains | daily_unique_domains_max | daily_tabs_opened | daily_tabs_opened_max |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | beta | 0.7988445 | 1.471189 | 146.33392 | 287.3003 | 2.324319 | 5.100206 | 15614.038 | 15779.79 | 9.019717 | 12.828446 | 5.148112 | 8.581990 | 20.03166 | 37.29553 |
| mean | beta - matched | 0.8681008 | 1.667582 | 162.35200 | 334.8136 | 2.538838 | 5.859896 | 20533.298 | 20742.87 | 8.533187 | 12.490330 | 5.402676 | 9.384929 | 18.95001 | 37.62945 |
| mean | release | 0.8504182 | 1.634401 | 157.83170 | 327.7541 | 2.441522 | 5.633435 | 17027.187 | 17227.57 | 6.313040 | 9.512403 | 5.106225 | 8.839660 | 17.07011 | 33.29242 |
| median | beta | 0.5027778 | 0.962500 | 80.50000 | 152.0000 | 0.750000 | 2.000000 | 3347.500 | 3513.00 | 4.125000 | 6.000000 | 3.500000 | 5.200000 | 8.50000 | 15.00000 |
| median | beta - matched | 0.5935764 | 1.195833 | 99.84524 | 206.0000 | 1.000000 | 3.000000 | 6640.125 | 6861.00 | 4.000000 | 6.000000 | 3.833333 | 6.000000 | 9.00000 | 18.00000 |
| median | release | 0.5729167 | 1.162500 | 97.20000 | 198.0000 | 1.000000 | 2.000000 | 5294.000 | 5464.00 | 3.750000 | 6.000000 | 3.650366 | 6.000000 | 8.75000 | 17.00000 |
Now, we want to know if there is any significant difference between the average user engagement metrics in the Beta and Release groups, over several versions (v67 and v68). Once again, we use the Wilcoxon test with the following question: Is there any significant difference between Beta-matched (v68) and Release (v68) user engagement metrics?
| p_value | diff | |
|---|---|---|
| active_hours | 0.8058407 | FALSE |
| active_hours_max | 0.8665102 | FALSE |
| uri_count | 0.9320963 | FALSE |
| uri_count_max | 0.9368539 | FALSE |
| search_count | 0.7363237 | FALSE |
| search_count_max | 0.4779041 | FALSE |
| num_pages | 0.0040118 | TRUE |
| num_pages_max | 0.0040251 | TRUE |
| daily_max_tabs | 0.4330177 | FALSE |
| daily_max_tabs_max | 0.7022354 | FALSE |
| daily_unique_domains | 0.7043529 | FALSE |
| daily_unique_domains_max | 0.4844754 | FALSE |
| daily_tabs_opened | 0.6283246 | FALSE |
| daily_tabs_opened_max | 0.3686337 | FALSE |
For a graphical comparison, we plot the covariate distributions for the following subsets:
NOTE: Guiding lines have been added for the following:
Mean
| daily_num_sessions_started | daily_num_sessions_started_max | FX_PAGE_LOAD_MS_2_PARENT | memory_mb | num_active_days | num_addons | num_bookmarks | profile_age | session_length | session_length_max | TIME_TO_DOM_COMPLETE_MS | TIME_TO_DOM_CONTENT_LOADED_END_MS | TIME_TO_DOM_INTERACTIVE_MS | TIME_TO_LOAD_EVENT_END_MS | TIME_TO_NON_BLANK_PAINT_MS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pre-matching | 0.1568175 | 0.2044783 | 0.2423907 | 0.0858950 | 0.1394702 | 0.2138894 | 0.4308288 | 0.0116675 | 0.2776138 | 0.2040924 | 0.5155437 | 0.3545873 | 0.4943291 | 0.5457763 | 0.3981456 |
| post-matching | 0.1150352 | 0.1219282 | 0.0466569 | 0.0321503 | 0.0465495 | 0.1070154 | 0.3912255 | 0.0296786 | 0.2691643 | 0.2294827 | 0.1287980 | 0.1242510 | 0.1430486 | 0.1370603 | 0.1162488 |
Median
| daily_num_sessions_started | daily_num_sessions_started_max | FX_PAGE_LOAD_MS_2_PARENT | memory_mb | num_active_days | num_addons | num_bookmarks | profile_age | session_length | session_length_max | TIME_TO_DOM_COMPLETE_MS | TIME_TO_DOM_CONTENT_LOADED_END_MS | TIME_TO_DOM_INTERACTIVE_MS | TIME_TO_LOAD_EVENT_END_MS | TIME_TO_NON_BLANK_PAINT_MS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pre-matching | 0.1666667 | 0.25 | 0.2079933 | 0.0122646 | 0.1666667 | 0.2 | 0.1153846 | 0.0237389 | 0.0635072 | 0.0239411 | 0.2968198 | 0.2541060 | 0.2905375 | 0.3073948 | 0.2432585 |
| post-matching | 0.1538462 | 0.25 | 0.0399131 | 0.0000000 | 0.0000000 | 0.2 | 0.0294118 | 0.0515971 | 0.0757924 | 0.1015121 | 0.0666033 | 0.0665543 | 0.0719666 | 0.0721201 | 0.0627417 |
| metric | label | daily_num_sessions_started | daily_num_sessions_started_max | FX_PAGE_LOAD_MS_2_PARENT | memory_mb | num_active_days | num_addons | num_bookmarks | profile_age | session_length | session_length_max | TIME_TO_DOM_COMPLETE_MS | TIME_TO_DOM_CONTENT_LOADED_END_MS | TIME_TO_DOM_INTERACTIVE_MS | TIME_TO_LOAD_EVENT_END_MS | TIME_TO_NON_BLANK_PAINT_MS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | beta | 2.398131 | 4.141417 | 3592.767 | 8795.994 | 4.912574 | 6.882667 | 225.4153 | 875.2575 | 12.336721 | 22.27297 | 4596.715 | 2896.937 | 2626.875 | 4375.866 | 2014.849 |
| mean | beta - matched | 2.760920 | 5.067123 | 3007.539 | 9891.627 | 5.909091 | 6.471011 | 249.5533 | 1024.7847 | 11.449578 | 21.60048 | 3393.468 | 2399.501 | 1961.143 | 3160.932 | 1568.056 |
| mean | release | 2.844142 | 5.205914 | 2891.817 | 9622.521 | 5.708778 | 5.669929 | 157.5418 | 885.5902 | 9.656064 | 18.49772 | 3033.047 | 2138.613 | 1757.896 | 2830.853 | 1441.087 |
| median | beta | 1.666667 | 3.000000 | 3044.818 | 7973.000 | 5.000000 | 6.000000 | 23.0000 | 690.0000 | 7.154375 | 12.96139 | 3024.414 | 2001.342 | 1764.381 | 2856.994 | 1358.857 |
| median | beta - matched | 1.833333 | 3.000000 | 2608.835 | 8073.000 | 6.000000 | 6.000000 | 33.0000 | 856.0000 | 7.165972 | 13.96167 | 2490.743 | 1713.148 | 1456.643 | 2329.033 | 1145.001 |
| median | release | 2.000000 | 4.000000 | 2520.559 | 8072.000 | 6.000000 | 5.000000 | 26.0000 | 674.0000 | 6.727152 | 12.65833 | 2332.177 | 1595.832 | 1367.168 | 2185.257 | 1092.980 |
Now, we want to know if there is any significant difference between the average user engagement metrics in the Beta and Release groups, over several versions (v67 and v68). Once again, we use the Wilcoxon test with the following question: Is there any significant difference between Beta-matched (v68) and Release (v68) user engagement metrics?
| p_value | diff | |
|---|---|---|
| daily_num_sessions_started | 0.3099819 | FALSE |
| daily_num_sessions_started_max | 0.3237056 | FALSE |
| FX_PAGE_LOAD_MS_2_PARENT | 0.1381689 | FALSE |
| memory_mb | 0.0632427 | FALSE |
| num_active_days | 0.0082311 | TRUE |
| num_addons | 0.0000000 | TRUE |
| num_bookmarks | 0.0000032 | TRUE |
| profile_age | 0.0000000 | TRUE |
| session_length | 0.2317669 | FALSE |
| session_length_max | 0.4197083 | FALSE |
| TIME_TO_DOM_COMPLETE_MS | 0.0173453 | TRUE |
| TIME_TO_DOM_CONTENT_LOADED_END_MS | 0.0090137 | TRUE |
| TIME_TO_DOM_INTERACTIVE_MS | 0.0976424 | FALSE |
| TIME_TO_LOAD_EVENT_END_MS | 0.0489124 | TRUE |
| TIME_TO_NON_BLANK_PAINT_MS | 0.2094423 | FALSE |
For a graphical comparison, we plot the covariate distributions for the following subsets:
NOTE: Guiding lines have been added for the following:
Our main objective was to determine if the user engagement metrics changed in the newest Beta version concerning the previous Release version. From these prior analysis, we can see that there are significant differences between both groups (Beta and Release) concerning only two user engagement metrics, listed as follows.
num_pagesnum_pages_maxIn addition, by analyzing the distributions of the training covariates, the most different covariates are listed as follows.
num_addonsprofile_agenum_bookmarksnum_active_daysIn this project, we employ statistical matching methods to find subsets of Beta users that can be used to inform how Release in the general user community will behave. Particularly, statistical matching is used to evaluate the effect of a treatment by comparing treated and untreated units in an observational study. The purpose of matching is, for each treated unit, to find one (or more) untreated unit(s) with similar observable characteristics against which the effect of the treatment can be assessed. By matching treated samples to similar untreated ones, matching enables a comparison of outcomes among treated and untreated samples to evaluate the effect of the treatment.
Here, we use statistical matching methods in a non-traditional way. In particular, our goal is to search for a subset of Beta representative clients of Release. In our application, we still have two cohorts, i.e., Beta and Release data sets for a given Firefox version (N). However, we do not have a single outcome, as we do not have a treatment. Rather, we are trying to equate (or balance) the Beta and Release data sets to resemble each other across the covariates we are concerned with. This is our outcome, so to speak.
In this current work, we focus on user engagement metrics as the chosen use-case. Also to validate the resultant matching model, we follow two different strategies. First, we balance the data sets on a set of training covariates (i.e., environment, machine configuration, performance and usage metrics), then look at the difference in the user engagement metrics between the balanced Beta and Release for a same Firefox version (v67). This gives us an idea of how clients with similar environments and performance resemble Release in terms of usage. In the second, we are more interested in balancing the data sets to resemble each other across the covariates we are concerned with, but now over several versions (v67 and v68). This gives us an idea of how these users do indeed change in time.
For both strategies, we apply the same experimental setup. The only difference is in the considered Firefox version we were comparing. In the first strategy, our main objective was to inform how users Beta are different concerning Release in terms of user engagement, with all the other training covariates being equal. Our findings show the matching worked well, in general. However, for a subset of covariates, the difference between channels actually increased (e.g., num_pages, num_pages_max, daily_unique_domains and daily_unique_domains_max), or are relatevly different than Release before and after matching, namely active_hours, active_hours_max, uri_count, and uri_count_max. Besides, we observe that these Beta users are very similar to Release ones regarding search count and daily tabs opened.
In the second strategy, our main objective was to determine if the user engagement metrics changed in the newest Beta version concerning the previous Release version. Overall, the matching yielded a subset that was similarly representative to v67 as to v68 for most of the covariates reviewed. However, for a subset of covariates, the difference between channels actually decreased (num_pages, num_pages_max, daily_unique_domains and daily_max_tabs).